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Abstract 

We propose a non-convex training objective 
for robust binary classification of data sets in 
which label noise is present. The design is 
guided by the intention of solving the result- 
ing problem by adiabatic quantum optimiza- 
tion. Two requirements are imposed by the 
engineering constraints of existing quantum 
hardware: training problems arc formulated 
as quadratic unconstrained binary optimiza- 
tion; and model parameters are represented 
as binary expansions of low bit-depth. In the 
present work we validate this approach by us- 
ing a heuristic classical solver as a stand-in 
for quantum hardware. Testing on several 
popular data sets and comparing with a num- 
ber of existing losses we find substantial ad- 
vantages in robustness as measured by test 
error under increasing label noise. Robust- 
ness is enabled by the non-convexity of our 
hardware-compatible loss function, which we 
name q-loss. 



1. Introduction 

In recent years machine learning researchers and prac- 
titioners have been focusing on convex optimization 
methods due to their computational advantages and 
well understood mathematical properties. The many 
successes of convexity-based algorithms are witnesses 
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to that. While it is easily recognized that allowing for 
non-convex objectives opens up a plethora of possibili- 
ties for better solutions to machine learning problems, 
much of the contemporary research has deliberately 
avoided them. The reason is the widely known fact 
that non-convexity often results in NP-hard problems. 

However this choice comes at a cost as the shortcom- 
ings of convex objectives are also well understood. Re- 
cent work (Long & Servedio, 2010) showed that convex 
loss functions cannot be made robust in the presence 
of label noise because they cause unbounded growth of 
penalties for large negative margins. (Manwani & Sas- 
try, 2011) further characterized this effect by analyzing 
various convex losses and found that none of them is 
tolerant to non-uniform label noise. In practice label 
noise turns out to be a serious problem due to the fact 
that it affects real-world data sets to a significant de- 
gree. Since label noise manifests itself throughout the 
optimization as large negative margins, the finally con- 
structed decision hyperplane that represents the global 
minimum of any convex loss tends to be pulled by the 
mislabeled training examples away from the minimizer 
of classification error. Therefore, even though solving 
convex losses to optimality is feasible, when label noise 
causes the lowest objective value to not correspond to 
the lowest attainable training error, the entire exer- 
cise misses the mark. Consequently, any approach ex- 
hibiting this problem does not stand to benefit from 
improved optimization techniques. 

For example, Fig. 1 shows the broken correspondence 
between training error and objective value when a con- 
vex loss is used in a training problem of practical 
significance — "OCR in photos". The human task of 
tagging characters in photos of potentially poor qual- 
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Figure 1. Relationship between training error and inverse 
empirical risk produced by minimizing square loss on six 
different binary classifiers for digits (e.g. '1' vs the rest, 
'2' vs the rest, etc.) The data ("OCR in photos"; 10200 
dimensions; 38924 examples; 10 classes) represents a chal- 
lenging real-world training problem of significant practical 
importance. An adequate loss function should generally be 
decreasing training error as the empirical risk approaches 
global minimum (top plots). Unfortunately, the opposite 
effect (bottom plots) can often be observed when working 
with convex losses. The failures are found to be due to 
two factors, both of which cause square loss to be drasti- 
cally misled by its convexity: occasionally mistaken labels 
resulting from the semi-automatic process generating the 
data; and the presence of examples of one class that may 
be similar to examples of another class (e.g. '6' and '8'). 



ity is not easy, so the presence of mislabeled exam- 
ples in the training set is not surprising. Even worse, 
routinely used semi-automatic preparation of training 
data is also contributing to mistakes. The problem 
may gradually disappear for cleaner data sets, which 
often happen to be the cases when convex losses pro- 
duce excellent classifiers. Unfortunately the nature of 
large-scale supervised learning does not permit elab- 
orate quality assurance for data sets that are handed 
out to training algorithms; accordingly label noise will 
continue to pollute real-world data sets. Moreover fu- 
ture intelligent systems will rely increasingly on weakly 
labeled or unlabeled data increasing the need for noise 
tolerance. 

(Ding & Vishwanathan, 2010) and (Masnadi-Shirazi 
et al., 2010) took these lessons and independently stud- 
ied two different non-convex but seemingly well be- 
haved types of loss functions. (Collobcrt ct al., 2006; 
Ertekin et al., 2011) also explored non-convexity in 
the context of SVM with ramp loss, but their focus 
was on achieving sparser sets of support vectors and 
speed of training rather than improved accuracy and 
robustness of the constructed classifier. 

In the present work we continue the study of non- 
convexity. We report on training with a non-convex 
objective using discrete optimization in a formula- 



tion adapted to take advantage of emerging hard- 
ware that performs adiabatic quantum optimization 
(AQO). AQO, first proposed in (Farhi ct al., 2000), is 
a quantum computing model with good prospects for 
scalable and practically useful hardware implementa- 
tion. Studies of its purported computational superior- 
ity over classical computing have repeatedly given en- 
couraging results, e.g. (Dickson & Amin, 2011). Sig- 
nificant investments are underway by the Canadian 
company D-Wave to develop a hardware implementa- 
tion. A series of rigorous studies of the quantum me- 
chanical properties of the D-Wave processors, culmi- 
nating in a recent Nature publication (Johnson et al., 
2011), have increased the excitement in the quantum 
computing community for this approach. This was 
further fueled by news of a successful collaboration 
with Google (Neven et al., 2009a) and of Lockheed 
Martin purchasing an adiabatic quantum computer. 
For machine learning purposes, D- Wave's implemen- 
tation of AQO can be regarded as a black-box discrete 
optimization engine that accepts any problems for- 
mulated as quadratic unconstrained binary optimiza- 
tion (QUBO), also equivalent to the Ising model and 
Weighted MAX-2-SAT. It should be noted that this 
training formulation is a good format for AQO inde- 
pendently of D-Wave's efforts since it can be phys- 
ically realized as the simplest possible multi-qubit 
configuration — an Ising system (Brush, 1967). We do 
not claim principled superiority of q-loss over other 
non-convex losses. However g-loss is distinguished by 
the fact that it can be formulated for AQO on quan- 
tum hardware that only supports quadratic (2-local) 
interactions among its qubits using a number of ancil- 
lary variables that just grows linearly with the num- 
ber of training examples. To the best of our current 
knowledge, no other non-convex loss has this prop- 
erty 1 . While all other non-convex losses are tackled 
by heuristic optimization with very limited success, q- 
loss may be solvable to optimality by AQO. 

The paper is organized as follows: Section 2 defines 
the training problem; Section 3 introduces g-loss, de- 
rives its QUBO formulation, and discusses the intu- 
ition behind it; Sections 4 and 5 deal with choosing 
hyper-parameter values and discretization of variables; 
Section 6 presents our experiments; and Section 7 con- 
cludes with an overview and discussion. Technical de- 
tails can be found in the supplementary material. 

1 Except for the non-margin-enforcing 0-1 loss (Neven 
et al., 2009b) 
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2. Training a binary classifier 

We study binary classifiers y — sign (w T x + b) , where 
x £ M. N is an input pattern to be classified, y € { — 1, 1} 
is the label associated with x, w e M. N is a vec- 
tor of weights to be optimized, and b € E is the 
bias. Training, also known as regularized risk mini- 
mization, consists of choosing w and b by simultane- 
ously minimizing two terms: empirical risk R(w, b) = 
^2 S=1 L (m (x s ,y s ,w,b)) / S and regularization Q(w). 
R, via a loss function L, estimates the error that any 
candidate classifier causes over a set of 5* training ex- 
amples {(x s , y s )\s = 1, . . . , S}. The argument of L is 
known as the margin of example s with respect to the 
decision hyperplane defined by w and b: 



m(x s ,y s ,w,b) = y s ( 



w T x* 



(1) 



O controls the complexity of the classifier and is nec- 
essary for good generalization because classifiers with 
high complexity display overfitting — they can classify 
the training set with low error but may not do well on 
previously unseen data. Training amounts to solving 



(w,b)* — argminji? (w, 



Q(w)} 



(2) 



The most natural choice for L is 0-1 loss, which simply 
indicates a misclassification for a negative margin: 



Lo-i(m) = (1 - sign(m)) /2 



(3) 



Due to the non-convexity of Lq-i, the resulting op- 
timization problem (2) is NP-hard (Fcldman et al., 
2010). To avoid dealing with NP-hard optimization 
problems, in practice Lo-i is replaced by some con- 
vex upper bound (e.g. square, logistic, exponential, 
hinge), and Q is usually chosen as i±- or i^-norm pe- 
nalization of w. This allows arriving at convex opti- 
mization problems that can be rigorously analyzed and 
efficiently solved by classical means. However, such re- 
laxations are known to compromise the original goal of 
training because convex losses can be severely misled 
by label noise in the training data. 

3. g-loss 

Because the quantum hardware natively represents a 
general family of quadratic functions, the simplest loss 
function that would work is square loss, which is a 
convex upper bound to Lq-i : 



L 



square 



(m) = (m — 1) 



(4) 



However, there are two drawbacks of square loss when 
applied to binary classification. First, in binary clas- 
sification it does not make sense to penalize large pos- 
itive margins. Second, as mentioned earlier, square 



q = -3 
q = -2 
■ q = -1 




Figure 2. Top: q-loss for different values of q. Middle: q- 
loss with three members of the quadratic upper bounds 
family, f € K is the variational parameter. Bottom: Tran- 
forming the y-axis for concavity. 



loss has the same flaw as all convex losses — penalties 
for large negative margins grow unboundedly, which 
can cause non-robustness with respect to label noise. 

With these considerations in mind, we modify square 
loss in order to obtain a training formulation for binary 
classification that is both compatible with quantum 
hardware and robust to label noise. The resulting loss, 
which we name q-loss (Fig. 2, top), is essentially a 
doubly truncated version of (4) with parameterization 
over q e (—oo,0] defined as follows: 



Definition 1 (q-loss) 



L q (m) 



(1 — q) , (max (0, 1 — to))' 



(5) 



Unfortunately, (5) does not lead to a QUBO. How- 
ever, it turns out that we can transform it into a prob- 
lem which can be solved as a QUBO. The basic idea 
is to find a variational approximation via a family of 
quadratic functions that upper-bound q-loss and are 
governed by a variational parameter t G K as shown 
in Fig. 2, middle. 
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Theorem 2 q-loss in (5) is equivalent to: 



L q (to) = rain < (to — t) + (1 — q) 



2 (l-sign(t-l)) 



Proof Since g-loss is non-convex, the standard deriva- 
tion via convex duality (Jordan et al., 1999) dictates 
that we first find a new coordinate system in which q- 
loss is concave or convex. Then we calculate the con- 
jugate function for linear bounds in the transformed 
space and transform back to the original space where 
the linear bounds become the quadratic bounds shown 
in Fig. 2, middle. Because of the presence of two 
constant segments in g-loss, any coordinate system 
in which the two axes are independent transforma- 
tions of the original x and y axes clearly cannot re- 
sult in concavity or convexity. Thereby we are led 
to the transformation f(y) — y — x 2 , which gives 



f{L q {m)) = L q (m) 



in 



It can be seen (Fig. 2, 



bottom) that in this transformed space g-loss is con- 
cave and the quadratic upper bounds become tangent 
lines. The conjugate function in the transformed space 
is g (77) = min m {77777 - / (L q (to))}. 

To minimize, we seek stationary points by differenti- 
ating 0(to to) = 77 to — f(L q (m)) with respect to m: 



_d_ 
dm 



4> (77, to) = 7/ 



dm 



L q (TO) + 2777 



(6) 



2 for to e (q, 1) 
2to for to G (—00, q) U (1, 00) , 



as yielded by piecewise differentiation of L q (m). Set- 
ting to gives the stationary points 

77 = -2 for 777 G (q, 1) (7) 
to = —77/2 for m G (—00, q) U (1, 00) . (8) 

Plugging them back into the conjugate function yields 



- \ - (1 - q) 2 for to e (-00, q) 



g(ji) = { -1 



v_ 

4 



for to £ (q, 1) 
for to € (1, 00) 



(-?-'» . (9) 



In accordance with convex duality, 
/ ( L g ( m )) = min {V m ~ 9 iv)} 



(10) 



mm < 77777 H 

<? 4 



(i ^-^H- 1 )) 



Transforming back into the original space and setting 
t = —77/2, the variational upper bound for g-loss is 



L q (m) = f- 1 (f(L q (to))) 



(11) 



1 / .s2 . /, x2 (l-sign(rj- 1)) 
rnm-j (m-t) + (1 - g) 



3.1. Latent variables view 

Traditionally when facing non-convex optimization 
problems, a viable approach is to introduce latent vari- 
ables that allow reformulating over a simpler family of 
functions. This is precisely what Theorem 2 achieves. 
For any fixed to, the latent variable t € M gives a con- 
vex optimization problem whose minimum is L q (m): 



L q (to) = h (to, t* (to)) , where 
t* (to) = argmin {h (to, t)} 



(12) 



h (to, t) = (to - if + (1 - g ) 2 (1 - sign (t - 1)) /2 

The regularized risk minimization (2) with empirical 
risk over L q in the form (12) is amenable to a block 
coordinate descent method for jointly optimizing the 
model parameters (w, b) and the latent variables t s 
for s — 1, . . . , S: similarly to EM, alternate between 
convex optimization runs over the latent variables (t 
step) and the model parameters (w step). Even though 
such methods do well on some problems with certain 
benign structure — e.g. Gaussian mixtures (Dempster 
et al., 1977)) — they are also known to fail on other 
problems that lack such structure. We believe g-loss 
belongs to the latter group and have verified that a 
block coordinate descent method is likely to be sensi- 
tive to initialization and is quickly terminating in bad 
local minima. The intuitive reason is that due to the 
quadratically growing penalty for mismatching a mar- 
gin with its latent variable, the t step tends to lock 
in the model parameters found during the previous w 
step, thus possibly preventing the next w step from 
moving to a different model. The impact of this effect 
becomes ever more severe for large data with S » N. 

On the other hand, by transforming (5) into (12) we 
have made training with g-loss representable in QUBO 
form albeit at the expense of additional variables. Sec- 
tion A of the supplementary material explicitly shows 
the QUBO problem that can be derived from (12). 
Since the goal of AQO is to perform global optimiza- 
tion simultaneously over all variables, we believe AQO 
is a much better candidate for training with g-loss. 
Besides making the QUBO formulation possible, the 
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introduction of latent variables also gives rise to an 
intuitive interpretation of the mechanism by which q- 
loss achieves robustness when compared to the non- 
robustness of square loss. While in (4) the fixed target 
1 has to be matched as closely as possible by m, in 
(12) t plays the role of a flexible target that can change 
sign for a large negative margin, thereby flagging that 
training example as mislabeled. For any m, the mini- 
mizer t*(m) in (12) belongs to one of three cases: 

• Case I: m > 1 =>■ t* (m) = m 

• Case II: q < m < 1 =>■ t*(m) = 1 

• Case III: m < q =>■ t*(m) = m 

Case I ensures zero penalty for large positive margins; 
Case II produces the same quadratic penalty as (4); 
Case III can be seen as flipping the label of a possi- 
bly mislabeled example but also incurring a constant 
penalty of (1 — q) 2 in order to not lose connection with 
the original labeling. Thus, the hyper-parameter q de- 
fines the largest negative margin to be tolerated. A 
training example that has a negative margin with some 
larger magnitude gets flipped with constant penalty. 

4. Bounding q 

While it is difficult to formalize any general statements 
about the computational hardness of q-loss, it is easily 
recognized that the hardness depends on the size of the 
parabolic segment controlled by q. For q — > — oo, even 
the negative margins of highest magnitude incur the 
usual quadratic penalty, and the loss becomes effec- 
tively convex. For smaller q the loss becomes similar 
to 0-1 loss, so the resulting optimization problems may 
be approaching the hardness of the corresponding 0- 
1 loss problems. However, the most beneficial regime 
of operation is not known a-priori. This necessitates 
cross-validation over g, which, depending on the noise 
level, we expect to result in some trade-off between 
robustness and computational hardness. For the pur- 
pose of choosing values for cross-validation, we give an 
approximate lower bound for q as a function of our 
estimate of the underlying Bayes error in the data and 
the label noise that we might artificially insert into the 
training set for robustness evaluation. 

Let the effective Bayes error be /3 e // £ [0,0.5). This 
should account both for the Bayes error /3q of the data 
that we are given and the additional error v £ [0, 0.5) 
that we introduce by injecting label noise. Then if we 
wish for the entire /? e // portion of the training set to 
be flagged by g-loss as mislabeled, the empirical risk 
is R(w, b) > /3 e ff * (1 — q) 2 . But we know the trivial 
solution consisting of all weights has i?(0, 0) = 1. 
Then we want /3 e // * (1 — q) 2 < 1, which, together 
with q e (-oo,0], gives q € (1 - 1/^/3^/7,0]. 



Usually we do not have fio, but we can obtain an em- 
pirical estimate by training on the given data: Pemp — 
flo + Popt + Pgen, where j5 op t is the additional error 
caused by imperfect optimization, and /3 gen represents 
the generalization component of the overall test error. 
Assuming f3 emp is sufficiently close to /3q and account- 
ing for the artificially introduced label noise v, we set 
P e ff = P em p — 2f3 emp i> + v. The subtraction corrects 
for originally bad examples that flip under v. 

5. Low-precision discrete variables 

The quantum optimization processor that we aim to 
deploy for solving g-loss requires problems to be dis- 
crete and formulated as QUBO. Further, the current 
hardware can handle a maximum of 512 binary vari- 
ables, which imposes the additional requirement of be- 
ing frugal with the bit-depth of weights. To that end 
we discretize the elements of w to some low bit-depth 
d w < 64. While this approach is somewhat unconven- 
tional, (Neven et al., 2008) argued there is no funda- 
mental reason why the weights should need high pre- 
cision and in fact showed a favorable sufficiency condi- 
tion of d w Rj \og(S/N) in the case of binary features. 
Even though classifiers constructed out of more general 
features have not been studied in this way, our experi- 
ments provide support for using low-precision weights. 

The reason for fixing at 1 the smallest positive margin 
that yields zero penalty in g-loss is the same as in hinge 
loss SVM (Bishop, 2006): any arbitrary rescaling of 
the weights w — > kw and bias b —> Kb does not change 
the geometric distance y s (w T x s + b) /\\w\\ from a data 
point (x s ,y s ) to the decision surface. Therefore, we 
can assume a margin of 1 for the correctly classified 
point that is closest to the decision surface. However, 
this freedom of arbitrary rescaling becomes compli- 
cated when the bit-depth of weights is lowered. We 
want the intervals for weight variables to cover the 
maximum magnitude that the interplay between mar- 
gin enforcement and regularization may demand. On 
the other hand, a loose interval decreases the effective 
precision in sub-intervals that may really matter. For 
that reason we derive a A-dependent bound for set- 
ting the intervals in which discrete weight variables 
can take values. 

Let F(w, b) = R(w, b) + Xfl(w) be the objective func- 
tion. For g-loss, F(0, 0) = R(0, 0) = 1 and 3 w 3 
F(0,0) = \n(w). Then, 

F(w, b) = R(w, b) + XQ(w) > XCl(w) = F(0, 0) (13) 

Also, F(-w,b) > F(0,0) because fl(-w) = Q(w). 
Now consider any w 3 fl(w) > Q(w): 

F(w, b) > \n(w) > XQ(w) = F(0, 0) (14) 
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Hence, F(w, b) > F(w, b) > F(0, 0) and F(-w, b) > 
F(-w,b) > F(0,0). Thus, we can use Q{w) = 1/A 
to bound the intervals in which the weight variables 
live while ensuring that the minimizer of F belongs to 
these intervals. For ^-norm regularization, tt(w) = 
HHh > ll^lloo = max(|w|), so we train by optimizing 
Wj Vj only in the interval [— £l(w), il(w)}. 

The discrete optimization problem for training with 
g-loss and £2-norm regularization is: 

O, b)* = arg mm j L q [y s {w T x s + b)j +A||tu|| 2 

(15/ 

where w and b are the discretized w and b, and A £ M>o 
controls the relative importance of regularization. 

6. Experimental evaluation 

While prior work on non-convex losses applied var- 
ious forms of convex optimization (Masnadi-Shirazi 
et al., 2010; Yuille & Rangarajan, 2002; Liu et al., 
1989) hoping they can still be solved with somewhat 
reasonable quality, we take the approach of directly 
tackling the resulting problems by discrete optimiza- 
tion. Admittedly this choice makes the optimization 
method largely oblivious to existing benign structure 
and may cause us to face NP-hardness in certain situ- 
ations. However, we do this for the purpose of being 
compatible with emerging quantum hardware that can 
be employed as a black-box discrete optimization en- 
gine having the potential to do well on such problems. 

Quantum hardware was already successfully deployed 
by (Ncven et al., 2009a) on a large-scale training prob- 
lem with square loss and £o~ norm regularization. In 
the present work on g-loss with i^-norm regularization, 
we only verify the validity of our approach by using 
Tabu search (Palubeckis, 2004) as a classical heuris- 
tic stand-in and leave the quantum hardware to future 
work. A quantum optimization with g-loss is expected 
to achieve in shorter time equal or better results than 
our classical optimization setup. We do not report 
CPU time comparisons because they are irrelevant in 
the absence of quantum hardware runs. 

In order to show robustness, we randomly flip train- 
ing labels and observe the worsening of test error as 
a function of increasing label noise. While prior work 
on robust classification (Ding & Vishwanathan, 2010; 
Collobert et al., 2006; Freund, 2009) considered uni- 
form label noise, we note this does not adequately 
capture the essence of the true mechanism by which 
label noise trickles into real-world training tasks. In 
fact, recent work (Manwani & Sastry, 2011) shows that 



even convex losses can be robust under uniform noise. 
Moreover, experience with practical applications con- 
firms that the type of label noise that affects classifica- 
tion accuracy is never independent of the underlying 
data distribution. For example, if the human taggers 
preparing training data for a computer vision appli- 
cation receive somewhat inaccurate or ambiguous in- 
structions affecting only one of the classes, the result- 
ing label noise is strongly correlated with that class. 
For this reason we move to a noise model in which 
we introduce uniformly random flips only in the labels 
of one class — here WLOG of the negative class — and 
keep the labels of the other class clean. In the ex- 
periments described below, the percentage label noise 
refers to the probability with which we flip labels in 
the negative portion of training data. 

We conduct experiments on two synthetic and four 
UCI data sets (data summary in Section B of supple- 
mentary material). The synthetic data sets (Long & 
Servedio, 2010; Mease & Wyner, 2007) are designed to 
provide a stark distinction between robust and non- 
robust losses. We compare the classification perfor- 
mance of g-loss to seven other convex and non-convex 
^2-regularized methods: liblinear (^2-loss primal SVM) 
(Fan et al., 2008), t-logistic regression (Ding & Vish- 
wanathan, 2010), smoothed hinge loss (Zhang et al., 
2010), logistic regression, square loss, sigmoid loss, and 
probit loss (Bishop, 2006). For all methods except q- 
loss and liblinear we use Petsc/Tao implementations 
with convex optimization (Balay et al., 2011; Benson 
et al., 2010). We do not compare against ramp loss 
(Collobert et al., 2006), as (Ding & Vishwanathan, 
2010) already attempted it in a similar setting on the 
majority of data sets we use and were unable to pro- 
duce any salient results. Not surprisingly, this is an ex- 
ample of the inadequacy of convex optimization meth- 
ods with respect to non-convex problems. Also, we do 
not compare against 0-1 loss because it is not margin- 
enforcing. It is well known that if minimized, 0-1 loss 
yields the lowest possible training error, but due to the 
lack of margin enforcement, generalization is bad even 
when regularization is applied (Vapnik, 1998). 

With all methods we perform a standard 10-fold cross- 
validation procedure (Dicttcrich, 1998) for locating 
appropriate values of parameters affecting generaliza- 
tion. Fig. 3 presents the main results with an empha- 
sis on the consistently superior performance of g-loss 
across all data, especially at high levels of noise. We 
have verified that often in the high noise cases Tabu 
search fails to reach the lowest attainable objective 
value. Therefore we believe we are looking precisely 
at cases of computationally hard optimizations that 
fail classically but may be solved successfully by quan- 
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Figure 3. Test error vs label noise for 8 methods (see legend) on 2 synthetic data sets (Long-Servedio and Mease- Wyner) 
and 4 UCI data sets (covertype, mushrooms, adult9, web8). Error bars are obtained from 10-fold cross-validation. 



turn means. We note sigmoid and probit are some- 
times close competitors of g-loss but other times are 
the worst performers. This can be explained by their 
non- convexity, which gives them the potential for ro- 
bustness, but makes them hard to optimize reliably. 
However, unlike g-loss, we do not know of any AQO- 
compatible formulations for probit and sigmoid. 

q-loss allows us to identify training examples with 
possibly incorrect labels as the points with m < q. 
We recorded the points whose labels we flipped be- 
fore training (injected flips) and the points that g-loss 
flagged as mislabeled (trained flips). Fig. 4 summa- 



rizes the overlaps between these two sets. The sets 
of trained flips for covertype and adult9 are expect- 
edly larger due to the large Bayes error of these data 
sets. In the supplementary material we provide details 
on cross- validated hyper-parameter values (Sections C 
and D) and statistical significance tests for the ob- 
served error rates (Section E). 

7. Conclusion 

In this paper we introduced g-loss as a robust alter- 
native to convex losses that suffer in the presence 
of label noise. The QUBO format of the optimiza- 
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Figure 4. Venn diagrams showing overlap between flips in- 
jected in the data before training (injected flips) and flips 
indicated by training with g-loss (trained flips). Orange 
color shows portion of injected flips recovered by q-loss. 



tion incorporating q-loss makes this version of training 
an ideal candidate for applying emerging commercial 
AQO technology as the optimization method of choice. 
Moreover just by using a classical heuristic solver as 
a stand-in for hardware-based AQO, we were already 
able to show significant advantages in test error over a 
rich variety of data sets and across a number of exist- 
ing convex and non-convex losses. Our focus here was 
on formulating a robust loss that can be made com- 
patible with the engineering constraints imposed by 
emerging quantum hardware. Since with other non- 
convex losses there is no other choice but to resort 
to often failing convex optimization, g-loss stands out 
with its AQO compliance. This opens up new possibil- 
ities for achieving results better than ever seen before. 
Given such encouraging results, we see great potential 
for robust classification with g-loss under AQO. 

Even though (12) is a QUBO, future work still 
needs to address the fact that on large data sets 
this formulation may result in a number of binary 
variables that exceeds the available physical qubits. 
For that reason, options for training via repeated 
rounds of optimization-e.g. flavors of large neighbor- 
hood search — need to be studied. By using suitable 
graph embedding techniques, we also need to address 
the fact that not all quadratic interactions between 
QUBO variables have corresponding connections be- 
tween qubits on the physical device. Also, future work 
needs to investigate the asymptotic scaling of the time 
necessary for optimizing g-loss with AQO, similarly 
to the way that was done for square loss in (Neven 
et al., 2009b). Finally, an interesting open question is 
whether the derivation in Section 3 can be extended 
to expressing a more general class of functions as QU- 
BOs. 
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A. Explicit QUBO for training problem 

Here we explicitly demonstrate the QUBO problem resulting from training with empirical risk over (/-loss and Eu- 
clidian regularization. To that end we use the variational approximation of g-loss and discretize the optimization 
variables. 

Because the discretization itself is cumbersome and uninstructive, in Subsection A.l we first expand all terms 
with (mostly) continuous variables and show the general layout of the coefficient matrix Q for the QUBO problem 



t Quj = min < E UiUj [Q lJ ] +^Wi[<2i,j] > , (16) 



where in our case a; is a concatenation of the binary representations of the discretized TV weight variables 
w, the bias b, and the S variational parameters t. With the notation in (16) we adopt the convention of 
distinguishing problem coefficients Q*,* by placing them inside square brackets. The preceding symbols are 
always the corresponding variables. 

Finally, in Subsection A. 2 we replace the continuous variables w, 6, t respectively by their discretized versions 
u;, 6, t according to bit depths d w , db, dt and multiplier-offset pairs (a w ,j3 w ), (ab,/3b), {at,Pt) that determine 
the intervals in which the discrete variables take values. 

A.l. Expansion with continuous variables 

Using the variational approximation for g-loss, the empirical risk expands as 



\ E M".) - \ E t n H ~ 2msts + + (1 " qf (1 " siSD 2 (ts " 1)} I ■ (17) 



Now we expand the individual terms appearing on the right-hand side of (17). The goal is to distinguish the 
coefficients of the various terms in the optimization problem. Hence, in terminal expressions for each term we 
use the square brackets convention of (16). 



N 



N 



= (w T x s + b) 2 = WjWj [x B jX Bt j] +bb[l] +b^2wi[2x S! i] (18) 



i=i 



N 



-2m s t s = -2y s (w T x s + b)t s = t s E Wi[~2y s x Syi \ + bt s [-2y s ] 



i=l 



(1-sign (*,-!)) 



*2 =*.*»[!] 

(1 - ta,*)(l - = t sA [-(1 - q) 2 ] + (1 - qf 



(19) 

(20) 
(21) 



The idea behind (21) is to use the most significant bit t s ^ t in the binary expansion of t s as an indicator of 
sign(t s — 1). We give more details on that in Subsection A. 2. 
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The above individual expansions when summed over s become: 



JV 



s=l i=l 
J = l 



s=l 



£a-«) 



2 (1 -sign (f.-l)) 
2 



66[5] +6 J^'. 



jv s 



53 -2m s t s = 53 53 Wi * s [ -2 f s^*.*] + 6 53^ [~ 2 2/s 

i= 1 s — 1 s— 1 

S S 

53^=53^ s [i] 

s=l 

53* sA [-(i- q ) 2 ]+s(i- q y 



8=1 S = l 

s 



2 5]x s 



(22) 

(23) 
(24) 
(25) 



Finally, we can write down the terminally expanded version of the training problem. Note that due to the 
variational approximation of g-loss, we now have a joint optimization problem over the weight variables w, the 
bias 6, and the variational parameters t. 



(w, b,t)* = arg min < 

w,b,t 



JV 

£- 

i=l 
j=l 



1 5 

^ 5Z 



s=l 



JV 



-66 [1] +6 53 



i=l 



c 5Z Xs .' 



A,: 



JV S 



53 



+&5> 




JV 



53 w i w i W 



Hi 



(26) 



In (26) we dropped the term 5(1 — q) 2 coming from (25) because it only represents a constant offset. Fig. 5 shows 
the overall layout of the coefficient matrix implied by the coefficient groups A-T-L distinguished from (26). Note 
that (26) is still using the continuous variables (except for the bits < s ,d t ), but it is clear that after discretizing we 
can obtain the final QUBO from it. 

A. 2. Binary variables 

The final step for obtaining a QUBO is to discretize the continuous variables w, 6, t via binary expansions of 
bit-depth d w , d b , d t respectively. We denote the discrete variables by w, 6, t. We also define multiplier-offset 
pairs [a w ,/3 w ), (a b , /3b), (at, fit) that determine the intervals in which the discrete variables take values. 

We apply discretizing transformations by binary variables w*^, 6*, t* t * and the shorthand function 



'{w,b,t} 



(k) = 2 k ~ l /(2 d ^M - 1): 



Wi — )• Wi = a w 



^m,kS w (k) +(3 W for i = 1, ... ,N 



k=l 
db 

b -> 6 = a b 53 b k 8 b (k) + fi b 
fe=i 

dt 

t s -> i s = a t 53 ts,k$t(k) + fit for s = 1, . . . , 5 

k=l 



(27) 
(28) 
(29) 
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W J b 
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Figure 5. Overall layout of the coefficient matrix Q implied by (26) for the QUBO problem min u u} T Qui, where ui consists 
of the concatenation of the binary representations of the discretized variables w, b, and t. White areas in the matrix 
correspond to zero coefficients. 



The intervals in which the discrete variables take values are: 



m £ [fi w ;a w + (3 W ] 
b e [f3 b ;a b + /3 b ] 



As shown in (21) and (25), we take the most significant bit i Sj( j t of each variable t s as an indicator for sign(i s — 1). 
Therefore we need to choose the interval in which the variables t take values such that the most significant bit 
of each t s is zero for t s < 1 and one otherwise. This leads us to intervals for which the upper half of the 
representable values are greater than or equal to one, so we set fii Wl b t t\ = 1 — a< - u, ^ b,t} , which gives the intervals 
in terms of ot{ w b t } > only: 





l - 


~2 ] 


~2~ 


be 


l - 


&b 1 

2 ; 


Oi b ' 

2 . 


t s e 


l - 


2 ' 


a*" 



(30) 

(31) 
(32) 



Now we convert the various terms in (26) from continuous to binary variables, which gives the coefficients of the 
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final QUBO problem: 

N 



i=l 
3 = 1 



N 



E ( a w E Wi^ k 8 w (k) + p w \ a w E w hk '5 w {k') + f3 w (A itj ) 



»=i \ fc=i 

N d„ 



fc' = l 



JV d„ 



E E w itk Wj t k' [a 2 w 5 w {k)5 w {k')A t . j \ + EE Wl < fc 



,;=i fe=i 
j'=i fe'=i 



i=i fe=i 



N 



3 = 1 



N 



i=l 
3 = 1 



bb[B] -> 



dt, \ / dt, \ 

^ + p b Ua b J2 h 'Mk') + p b (B) = 

k=l ) \ fe' = l / 



dt, 



fe=i 
fc'=i 



c 

fe=l 



JV 



A r 



I a 6 E MfcO) + j E [ a ™ E w i,k' s w{k') + P w j (Cj) = 
V fe=l / i=l V fe'=l / 

AT d b d«, iV d ra 

EE X! K««A + E E Wi < fe ' ["u./3fc^(fc') c i] 



=i fe=i fc'=i 



»=i fc'=i 



AT 



a b f3 w 5 b (k) E c i 



at 



+E 6 * 

/c=i 

E E Wj<s 

N S / d m \ / d t \ 

4 EE ""E *"i,feM*) + 0™ J (at E ts,k>St(k') + Pt (X\, a ) 



AT S 



i=l s=l 



i=l s=l \ fe=l 
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AT S d„ d t 
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EE*- 
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AT 



i=l 



AT 5 



i=l s=l 
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t s t s [Fs] -> 

s=l 

S / d t \ / d t \ 

8=1 V fc = l /V fe' = l / 

S d t S d t S 

= E E *«.**«.*' WMW] + EE^ fc [2at/3 t 5 t (fc)J- fl ] + a 2 5] j- a 

s=l fc=l s=l fc=l s=l 

fc'=l 



AT 



i=l 

N / d w \ / d w \ 

->• E ( E w i-k S w(k) +P W \ I a w E w i,fc'^( fc ') + A" J (%) = 
i=l \ fe=l / \ fc'=l / 

AT d„ AT d w N 

i=l fe=l i=l fe=l i=l 

fc'=l 



B. Data summary 



Table 1. Summary of data sets 



Name 


Dims 


^Examples 


Density (%) 


Baseline error (%) 


d w 




21 


2000 


100.00 




2 


Mease- Wyner 


20 


2000 


100.00 


49.80 


2 


covertype 


54 


581012 


22.20 


36.46 


4 


mushrooms 


112 


8124 


18.75 


48.20 


4 


adult9 


123 


48842 


11.30 


23.93 


4 


web8 


300 


59245 


4.20 


2.92 


4 



C. g values for g-loss and t values for t-logistic 



Table 2. Approximate lower bounds for q in g-loss computed according to Section 5. In each case the q values offered to 
cross-validation are taken as the 10 equally spaced values between the bound and (both inclusive). 



Data set name 




Label noise (%) 









10 


20 


30 


40 


Long-Servedio 


-1000 


-3.486401 


-2.172365 


-1.590225 


-1.243201 


Mease- Wyner 


-25.726124 


-3.333979 


-2.084934 


-1.524450 


-1.188680 


covertype 


-1.133948 


-0.979198 


-0.853870 


-0.749685 


-0.661297 


mushrooms 


-69.710678 


-3.400772 


-2.114833 


-1.544074 


-1.203589 



web8 -8.901475 -2.081794 -1.233931 -0.839671 -0.600122 
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Table 3. q values for g-loss picked by cross-validation Table 4. t values for t-logistic picked by cross-validation 



Data set name 




Label 


noise 






Data set name 




Label 


noise 






u 


1 n 




ou 


/in 
^u 


n 
u 






ou 




























Mease- Wyner 





-2.96 


-1.62 


-1.36 





Mease- Wyner 


1.1 


1.6 


2.0 


1.9 


1.9 






-0.54 


-0.38 


-0.5 


-0.51 






1.2 


1.5 


1.3 


1.2 


mushrooms 





-0.76 


-0.47 


-0.17 


-0.13 


mushrooms 


1.1 


1.1 


1.2 


1.1 


1.1 


adult9 


-0.86 


-0.53 


-0.43 


-0.53 


-0.07 


adult9 


1.2 


1.3 


2.0 


1.2 


2.0 


web8 


-0.99 


-0.46 ■ 


-0.41 


-0.19 





web8 


1.1 


1.1 


1.1 


1.2 


1.1 



D. Regularization strength 



Table 5. A values offered to cross-validation (C for liblinear is 1/A) 

A 



2.000090 
0.398965 
0.079583 
0.015875 
0.003167 
0.000632 
0.000126 
0.000025 
0.000005 
0.000001 





Table 6. C value 


s for liblinec 


ir picked by cross- 


validation 




Data set name 






Label noise (%) 









10 


20 


30 


40 


Long-Servedio 


0.499978 


2.506486 


0.499978 


0.499978 


0.499978 


Mease- Wyner 


40000.000000 


0.499978 


315.756236 


12.565498 


62.992126 


covertype 


0.499978 


2.506486 


62.992126 1000000.000000 


12.565498 


mushrooms 


2.506486 


12.565498 


0.499978 


0.499978 


0.499978 


adult9 


0.499978 


62.992126 


0.499978 


0.499978 


0.499978 


web8 


315.756236 


0.499978 


12.565498 


12.565498 


0.499978 



Table 7. A values picked by cross-validation for 0% label noise 



Data set name 








Method 








q 


logistic 


square 


t-logistic 


sigmoid 


probit 


smooth hinge 


Long-Servedio 


0.015875 


0.003167 


0.079583 


0.003167 


0.000632 


0.003167 


0.015875 


Mease- Wyner 


0.000126 


0.000001 


0.000025 


0.000001 


0.000025 


0.003167 


0.000001 


covertype 


0.000025 


0.000025 


0.000025 


0.000001 


2.000090 


2.000090 


0.000001 


mushrooms 


0.000025 


0.000001 


0.000025 


0.000126 


0.000632 


0.015875 


0.000632 



web8 0.000632 0.000001 0.000005 0.000005 0.000126 0.000632 0.000001 



Robust Classification with Adiabatic Quantum Optimization 





Table 8. 


A values picked by cross 


-validation for 10% label 


noise 




Data set name 








Method 








q 


logistic 


square 


t-logistic 


sigmoid 


probit 


smooth hinge 


Long-Scrvedio 


0.015875 


0.000005 


2.000090 


0.000126 


0.000632 


0.003167 


0.003167 


Mease- Wyner 


0.000126 


0.000005 


0.000632 


0.000001 


0.000632 


0.003167 


0.000005 




0.000001 


0.000025 


0.000632 


0.000001 


0.000632 


0.003167 


0.000126 


mushrooms 


0.003167 


0.000005 


0.000001 


0.000001 


0.000632 


0.003167 


0.000005 


adult9 


0.015875 


0.000632 


0.003167 


0.000632 


0.003167 


0.003167 


0.000126 


web8 


0.000632 


0.000005 


0.000126 


0.000001 


0.000126 


0.000632 


0.000005 




Table 9. A values picked by cross 


-validation for 20% label 


nois6 




Data set name 








Method 








1 


logistic 


square 


t-logistic 


sigmoid 


probit 


smooth hinge 


Long-Servedio 


0.000126 


2.000090 


2.000090 


0.000025 


0.000632 


0.003167 


2.000090 


Mease- Wyner 


0.000126 


0.000025 


0.000005 


0.000001 


0.003167 


0.003167 


0.000126 




0.000001 


0.000001 


0.000126 


0.000001 


0.000025 


0.000632 


0.000126 


mushrooms 


0.003167 


0.000126 


0.000632 


0.000126 


0.000632 


0.003167 


0.000025 


adult9 


0.015875 


0.079583 


0.079583 


0.003167 


0.000632 


0.003167 


0.003167 


web8 


0.000632 


0.000001 


0.000001 


0.000005 


2.000090 


0.000632 


0.000126 




Table 10. 


A values picked by cross 


;-validation for 30% labe. 


1 noise 




Data set name 








Method 








q 


logistic 


square 


t-logistic 


sigmoid 


probit 


smooth hinge 


Long-Servedio 


0.003167 


2.000090 


2.000090 


0.000001 


0.000126 


0.003167 


2.000090 


Mease- Wyner 


0.000126 


0.000005 


0.000001 


0.000001 


0.003167 


0.003167 


0.000005 


covertype 


0.000025 


0.000001 


0.000126 


0.000001 


0.000632 


0.003167 


0.000025 


mushrooms 


0.003167 


0.000632 


0.003167 


0.000632 


0.003167 


0.003167 


0.000632 


adult 9 


0.003167 


2.000090 


0.003167 


2.000090 


0.000126 


0.000632 


2.000090 


web8 


0.000632 


0.000126 


0.000001 


0.000632 


0.000632 


0.003167 


0.000126 



TaWe 11. A values picked by cross-validation for 40% label noise 



Data set name 








Method 








q 


logistic 


square 


t-logistic 


sigmoid 


probit 


smooth hinge 


Long-Servedio 


0.003167 


2.000090 


2.000090 


0.000001 


0.000126 


0.000632 


2.000090 


Mease- Wyner 


0.000126 


0.000001 


0.000005 


0.000001 


2.000090 


0.003167 


0.000001 


covertype 


0.000001 


0.000001 


0.000001 


0.000001 


2.000090 


2.000090 


0.000001 


mushrooms 


0.003167 


0.000126 


0.000632 


0.000001 


0.003167 


0.015875 


0.003167 



web8 0.000632 0.015875 0.079583 0.015875 0.000632 0.000632 0.000632 
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E. Statistical significance 



Table 12. Paired t-test for statistical significance of the test error difference yielded by q-loss and an estimated closest 
competitor. The closest competitor is manually chosen on a per-data-set basis from the set of convex losses. We exclude 
the other non-covex losses (t-logistic, sigmoid, and probit) from this comparison as we do not know what their performance 
would be if they could be realiably solved to optimality. We reject the null hypothesis at a = 5% significance level. 'Y' 
means that the difference is significant and 'N' means the difference is not significant. 



Data set name 


Compared losses 




Label 


noise 


:(%) 







10 


20 


30 


40 


Long-Servedio 


smoothed hinge vs g-loss 


N 


N 


Y 


Y 


Y 


Mease- Wyner 


smoothed hinge vs g-loss 


Y 


Y 


Y 


Y 


Y 


covertype 


liblinear vs g-loss 


Y 


Y 


Y 


Y 


Y 


mushrooms 


smoothed hinge vs q-loss 


N 


N 


N 


N 


Y 


adult9 


smoothed hinge vs g-loss 


Y 


Y 


Y 


Y 


Y 


web8 


smoothed hinge vs g-loss 


Y 


Y 


Y 


Y 


Y 



